{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Learn OpenAI Whisper - Chapter 5\n",
        "## Notebook 1: Setting up Whisper for transcription\n",
        "\n",
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1j1hoGvoOfZkh_p8dKceC4lS895YZKs9k)\n",
        "\n",
        "This notebook provides a guide for setting up and using OpenAI's Whisper model for audio transcription tasks.\n",
        "\n",
        "I hope you find this guide both informative and practical, setting you up for success in your audio transcription endeavors. If you have any questions or need further clarification on any of the steps, feel free to reach out.\n",
        "\n",
        "Here's a high-level summary of its contents, along with suggested titles for code sections:\n",
        "\n",
        "1. **Setting up the environment and installing dependencies**\n",
        "2. **Downloading audio files**\n",
        "3. **Verifying of compute resources**\n",
        "4. **Loading the Whisper model**\n",
        "5. **Setting the Natural Language Toolkit (NLTK) library**\n",
        "6. **Transcribing and language detection**\n",
        "7. **Defining a transcription function**\n",
        "8. **Transcribing non-English audio with optional translation**\n",
        "9. **Enhancing transcription with custom options**\n",
        "   \n",
        "Through these steps, I aim to equip you with a comprehensive understanding of Whisper's transcription and translation capabilities. By applying these techniques, you'll be well-prepared to tackle a wide range of audio processing tasks, from simple transcriptions to more complex, multilingual projects."
      ],
      "metadata": {
        "id": "jtwssRT-Xi-K"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "1. **Setting up the environment and installing dependencies**\n",
        "   \n",
        "   This section contains commands to install necessary dependencies and packages such as ffmpeg for handling audio files, Python libraries for machine learning (like cohere, openai, tiktoken), and the Whisper library directly from its GitHub repository."
      ],
      "metadata": {
        "id": "cqAR3wcTasbS"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KVHm8JGYOH2U"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!sudo apt install ffmpeg\n",
        "!pip install -q cohere openai tiktoken\n",
        "!pip install -q git+https://github.com/openai/whisper.git"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "2. **Downloading audio files**\n",
        "   \n",
        "    Here, we download sample audio files from a GitHub repository and the OpenAI's content delivery network or CDN. It includes both English and Spanish audio samples, preparing the ground for transcription tasks.\n"
      ],
      "metadata": {
        "id": "fEW6eW1wa4Xh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Sample_Audio01.mp3\n",
        "!wget -nv https://github.com/PacktPublishing/Learn-OpenAI-Whisper/raw/main/Chapter01/Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3\n",
        "!wget -nv https://cdn.openai.com/API/examples/data/bbq_plans.wav\n",
        "!wget -nv https://cdn.openai.com/API/examples/data/product_names.wav\n",
        "\n",
        "audiofiles = ['Learn_OAI_Whisper_Sample_Audio01.mp3', 'Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3', 'bbq_plans.wav', 'product_names.wav']"
      ],
      "metadata": {
        "id": "BzFAqXNdP_3r"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "3. **Verifying of compute resources**\n",
        "\n",
        "    This cell checks for the availability of a GPU, which is crucial for running the Whisper model efficiently. It sets the device to either CPU or GPU based on the availability, ensuring optimal performance.\n",
        "\n",
        "    Using a GPU is the preferred way to use Whisper. If you are using a local machine, you can check if you have a GPU available. The first line results `False`, if Cuda compatible Nvidia GPU is not available and `True` if it is available. The second line of code sets the model to preference GPU whenever it is available."
      ],
      "metadata": {
        "id": "pblcrNgZfK-f"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import torch\n",
        "torch.cuda.is_available()\n",
        "DEVICE = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
        "print(f\"Using torch {torch.__version__} ({DEVICE})\")"
      ],
      "metadata": {
        "id": "K4c5uwE0e5q-"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "4. **Loading the Whisper model**\n",
        "\n",
        "    This section details the process of loading the Whisper model with a specific configuration (in this case, \"medium\") and checking its language support and parameter count. It's essential for understanding the model's capabilities and requirements.\n",
        "\n",
        "    There are several Whisper models available. A list of them is [here](https://github.com/openai/whisper/blob/main/model-card.md). Each model has pros and cons in terms of accuracy and speed (compute needed). We will use the `medium` model for this tutorial."
      ],
      "metadata": {
        "id": "mANIL6_Hf6X2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import whisper\n",
        "model = whisper.load_model(\"medium\", device=DEVICE)\n",
        "print(\n",
        "    f\"Model is {'multilingual' if model.is_multilingual else 'English-only'} \"\n",
        "    f\"and has {sum(np.prod(p.shape) for p in model.parameters()):,} parameters.\"\n",
        ")"
      ],
      "metadata": {
        "id": "iZm7wQ6vfdRQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "5. **Setting the Natural Language Toolkit (NLTK) library**\n",
        "\n",
        "    Before processing transcriptions, this part includes setup steps for NLTK, a toolkit in Python for working with human language data. It's used here to enhance the readability of transcriptions by sentence segmentation.\n"
      ],
      "metadata": {
        "id": "qBR5ctMVb4MO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import nltk\n",
        "nltk.download('punkt')\n",
        "from nltk import sent_tokenize"
      ],
      "metadata": {
        "id": "gDZPO6dPr8Ew"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "6. **Transcribing and language detection**\n",
        "\n",
        "    For each file, we first ensure it's in a format Whisper can process efficiently. Then, we leverage Whisper's capability to detect the language of the audio. This step is crucial for setting up our transcription tasks with the correct language, ensuring higher accuracy. We decode the audio and present the transcription in a readable format, sentence by sentence.\n",
        "\n",
        "    **Do you need to set `fp16` true or false?**\n",
        "    - **On GPU:** Setting `fp16` to `True` can be beneficial if your GPU supports half-precision computations. This is often the case with modern NVIDIA GPUs that support Tensor Cores (e.g., Volta, Turing, Ampere, and newer architectures). It can reduce memory usage and speed up model inference.\n",
        "    - **On CPU:** You should set `fp16` to `False`, as `fp16` computations are generally not supported or not efficient on CPUs. This avoids compatibility issues and ensures your code runs smoothly.\n",
        "\n",
        "    Using this dynamic approach allows your script to automatically adjust to the execution environment, making it more versatile and user-friendly, especially if you plan to run it across different hardware setups."
      ],
      "metadata": {
        "id": "cnHm66Gpcaab"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "for audiofile in audiofiles:\n",
        "    # Load audio and pad/trim it to fit 30 seconds\n",
        "    audio = whisper.load_audio(audiofile)\n",
        "    audio = whisper.pad_or_trim(audio)\n",
        "    # Make log-Mel spectrogram and move to the same device as the model\n",
        "    mel = whisper.log_mel_spectrogram(audio).to(model.device)\n",
        "    #Next we detects the language of your audio file\n",
        "    _, probs = model.detect_language(mel)\n",
        "    detected_language = max(probs, key=probs.get)\n",
        "    print(f\"----\\nDetected language: {detected_language}\")\n",
        "    # Set up the decoding options\n",
        "    options = whisper.DecodingOptions(language=detected_language, without_timestamps=True, fp16=(DEVICE == \"cuda\"))\n",
        "    # Decode the audio and print the recognized text\n",
        "    result = whisper.decode(model, mel, options)\n",
        "    print(\"Transcription of file '\" + audiofile + \"':\")\n",
        "    for sent in sent_tokenize(result.text):\n",
        "        print(sent)"
      ],
      "metadata": {
        "id": "8jxdUn0aP5_M"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "7. **Defining a transcription function**\n",
        "\n",
        "    To streamline our workflow, we define a function that handles the loading of audio files, sets transcription and translation options, and performs the transcription. This function is versatile, allowing us to specify whether we want translation alongside transcription, making it a handy tool for processing multiple files."
      ],
      "metadata": {
        "id": "d7T7c8Zmcikd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def process_file(audiofile, model, w_options, w_translate=False):\n",
        "    # Load audio\n",
        "    audio = whisper.load_audio(audiofile)\n",
        "    transcribe_options = dict(task=\"transcribe\", **w_options)\n",
        "    translate_options = dict(task=\"translate\", **w_options)\n",
        "    transcription = model.transcribe(audiofile, **transcribe_options)[\"text\"]\n",
        "    # Evaluates whether a translation is requested\n",
        "    if w_translate:\n",
        "        translation = model.transcribe(audiofile, **translate_options)[\"text\"]\n",
        "    else:\n",
        "        translation = \"N/A\"\n",
        "\n",
        "    return transcription, translation"
      ],
      "metadata": {
        "id": "gs5XatDs2VQk"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "8. **Transcribing non-English audio with optional translation**\n",
        "\n",
        "    Here, we apply our transcription function to a Spanish audio sample, demonstrating Whisper's multilingual capabilities. We show how to obtain both the transcription and, optionally, the translation, showcasing the model's utility in handling diverse languages.\n",
        "    \n",
        "    We introduce an interactive element that allows you to listen to the audio files directly within the notebook. This feature enhances the user experience, providing an immediate sense of the audio content being transcribed and translated."
      ],
      "metadata": {
        "id": "ENGzZ3oOctD3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "w_options = dict(without_timestamps=True, fp16=(DEVICE == \"cuda\"))\n",
        "audiofile = 'Learn_OAI_Whisper_Spanish_Sample_Audio01.mp3'\n",
        "transcription, translation = process_file(audiofile, model, w_options, False)\n",
        "\n",
        "print(\"------\\nTranscription of file '\" + audiofile + \"':\")\n",
        "for sent in sent_tokenize(transcription):\n",
        "    print(sent)\n",
        "print(\"------\\nTranslation of file '\" + audiofile + \"':\")\n",
        "for sent in sent_tokenize(translation):\n",
        "    print(sent)\n",
        "\n",
        "import ipywidgets as widgets\n",
        "widgets.Audio.from_file(audiofile, autoplay=False, loop=False)"
      ],
      "metadata": {
        "id": "h_oUto7xJ1Te"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "9. **Enhancing transcription with custom options**\n",
        "\n",
        "    We explore how to tailor the transcription process further using custom options, such as controlling the decoding temperature and providing an initial prompt. This allows us to fine-tune the transcription results, adapting to specific audio content like product names or technical terminology."
      ],
      "metadata": {
        "id": "RNrS1OZGc3nq"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "9.1. **Using a spelling guide**\n",
        "\n",
        "Let's tackle a common challenge when working with Whisper: ensuring it accurately transcribes uncommon proper nouns, such as product names, company names, or individuals. These elements can often trip up even the most sophisticated transcription tools, leading to misspellings."
      ],
      "metadata": {
        "id": "1QtycW-DigcH"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "w_options = dict(without_timestamps=True, fp16=(DEVICE == \"cuda\"), temperature=0, initial_prompt=\"\")\n",
        "audiofile = 'product_names.wav'\n",
        "transcription, translation = process_file(audiofile, model, w_options)\n",
        "\n",
        "print(\"------\\nTranscription of file '\" + audiofile + \"':\")\n",
        "for sent in sent_tokenize(transcription):\n",
        "    print(sent)\n",
        "\n",
        "import ipywidgets as widgets\n",
        "widgets.Audio.from_file(audiofile, autoplay=False, loop=False)"
      ],
      "metadata": {
        "id": "nP2w21IoTB8-"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "To guide Whisper towards our preferred spellings, we'll include these names directly in the prompt, essentially creating a glossary for Whisper to reference."
      ],
      "metadata": {
        "id": "ZocdhvfOjrXm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "w_options = dict(without_timestamps=True, fp16=(DEVICE == \"cuda\"), temperature=0, initial_prompt=\"Quirk Quid Quill Inc., P3-Quattro, O3-Omni, B3-BondX, E3-Equity, W3-WrapZ, O2-Outlier, U3-UniFund, M3-Mover\")\n",
        "audiofile = 'product_names.wav'\n",
        "transcription, translation = process_file(audiofile, model, w_options)\n",
        "\n",
        "print(\"------\\nTranscription of file '\" + audiofile + \"':\")\n",
        "for sent in sent_tokenize(transcription):\n",
        "    print(sent)"
      ],
      "metadata": {
        "id": "WCjxOqnxQQem"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**9.2 Prompting for transcript generation**\n",
        "\n",
        "Next, we'll pivot to a different audio clip crafted specifically for this exercise. The scenario? An unusual barbecue event. Our first step involves generating a baseline transcript with Whisper to assess its initial accuracy."
      ],
      "metadata": {
        "id": "5y3GH6erjtaN"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "w_options = dict(without_timestamps=True, fp16=(DEVICE == \"cuda\"), temperature=0, initial_prompt=\"\")\n",
        "audiofile = 'bbq_plans.wav'\n",
        "transcription, translation = process_file(audiofile, model, w_options)\n",
        "\n",
        "print(\"------\\nTranscription of file '\" + audiofile + \"':\")\n",
        "for sent in sent_tokenize(transcription):\n",
        "    print(sent)\n",
        "\n",
        "import ipywidgets as widgets\n",
        "widgets.Audio.from_file(audiofile, autoplay=False, loop=False)"
      ],
      "metadata": {
        "id": "206ZmGBFXApN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "While Whisper does an impressive job with transcription, it sometimes has to make educated guesses about spellings. Take, for instance, it rendering the names as Amy and Sean, when in fact, they're spelled Aimee and Shawn. Our goal now is to see if we can influence these spellings through a carefully crafted prompt."
      ],
      "metadata": {
        "id": "YC8L4YRpkTuA"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "w_options = dict(without_timestamps=True, fp16=(DEVICE == \"cuda\"), temperature=0, initial_prompt=\"Friends: Aimee, Shawn\")\n",
        "audiofile = 'bbq_plans.wav'\n",
        "transcription, translation = process_file(audiofile, model, w_options)\n",
        "\n",
        "print(\"------\\nTranscription of file '\" + audiofile + \"':\")\n",
        "for sent in sent_tokenize(transcription):\n",
        "    print(sent)"
      ],
      "metadata": {
        "id": "PnA8yY1DXApO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now, let's apply \"prompting for transcript generation\" convert instructions into fictitious transcripts for Whisper to emulate. Using that technique to other words with tricky spellings, we ensure our transcript is as accurate as possible in every detail."
      ],
      "metadata": {
        "id": "3JTAciKrkV1C"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "w_options = dict(without_timestamps=True, fp16=(DEVICE == \"cuda\"), temperature=0, initial_prompt=\"\"\"\"Aimee and Shawn had whisky, doughnuts, omelets at a BBQ.\"\"\")\n",
        "audiofile = 'bbq_plans.wav'\n",
        "transcription, translation = process_file(audiofile, model, w_options)\n",
        "\n",
        "print(\"------\\nTranscription of file '\" + audiofile + \"':\")\n",
        "for sent in sent_tokenize(transcription):\n",
        "    print(sent)"
      ],
      "metadata": {
        "id": "_35FCBquYSv2"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}